59 research outputs found

    Memory Migration on Next-Touch

    Get PDF
    International audienceNUMA abilities such as explicit migration of memory buffers enable flexible placement of data buffers at runtime near the tasks that actually access them. The move_pages system call may be invoked manually but it achieves limited throughput and implies a strong collaboration of the application. Indeed, the location of threads and their memory access patterns must be carefully known so as to decide when migrating the right memory buffer on time. We present the implementation of a Next-Touch memory placement policy so as to enable automatic dynamic migration of pages when they are actually accessed by a task. We introduce a new PTE flag setup by madvise, and the corresponding Copy-on-Touch codepath in the page-fault handler which allocates the new page near the accessing task. We then look at the performance and overheads of this model and compare it to using the move_pages system call

    Enabling High-Performance Memory Migration for Multithreaded Applications on Linux

    Get PDF
    International audienceAs the number of cores per machine increases, memory architectures are being redesigned to avoid bus contention and sustain higher throughput needs. The emergence of Non-Uniform Memory Access (NUMA) constraints has caused affinities between threads and buffers to become an important decision criteria for schedulers. Memory migration enables the dynamically joined distribution of work and data across the machine but requires high-performance data transfers as well as a convenient programming interface. We present the improvement of the Linux migration primitives and the implementation of a Next-Touch policy in the kernel to provide multithreaded applications with an easy way to dynamically maintain thread-data affinity. Microbenchmarks show that our work enables a high-performance, synchronous and lazy memory migration within multithreaded applications. A threaded LU factorization then reveals the large improvement that our Next-Touch policy model may bring in applications with complex access patterns

    Finding a Tradeoff between Host Interrupt Load and MPI Latency over Ethernet

    Get PDF
    International audienceAchieving high-performance message passing on top of generic Ethernet hardware suffers from the NIC interrupt-driven model where coalescing is usually involved. We present an in-depth study of the impact of interrupt coalescing on the Open-MX performance. It shows that disabling coalescing may not be relevant for most metrics except small-message latency. Two new coalescing strategies are then presented so as to efficiently support both latency-friendly and coalescing-friendly workloads thanks to the NIC looking at Open-MX messages and streams before deciding when to raise interrupts. The implementation of these strategies in the firmware of Myri-10G NICs shows that Open-MX is now able to achieve a low small-message latency, a high large-message throughput, and a satisfying message rate without having to manually tune the coalescing delay depending on the benchmark. Real application performance evaluation further shows that our modifications even improve the NAS Parallel Benchmark IS execution time by 7-8% thanks to our NIC firmware raising up to 20% of additional interrupts at the correct time

    Optimisation Mechanisms for MPICH/Madeleine

    Get PDF
    This report presents optimisations mechanisms within MPICH/Madeleine , the implementation of MPICH over Madeleine. These mechanisms aim to decrease the communication time of derived datatypes for which data is stored in noncontiguous memory areas. The report presents the mechanisms as well as some performance evaluation

    MPICH/Madeleine Installer's, User's and Developer's Guide

    Get PDF
    MPICH/Madeleine is a new free implementation of the MPI standard based on the MPICH implementation and the multi-protocol communication library called Madeleine. It aims to efficiently exploit clusters of clusters with heterogeneous networks. This manual presents an installer's, user's and developer's guide for MPICH/ Madeleine. The latest version of this document is available from the following URL: http://runtime.futurs.inria.fr/mpi/manual/

    Evaluation of OpenMP Dependent Tasks with the KASTORS Benchmark Suite

    Get PDF
    International audienceThe recent introduction of task dependencies in the OpenMP specifi-cation provides new ways of synchronizing tasks. Application programmers can now describe the data a task will read as input and write as output, letting the runtime system resolve fine-grain dependencies between tasks to decide which task should execute next. Such an approach should scale better than the excessive global synchronization found in most OpenMP 3.0 applications. As promising as it looks however, any new feature needs proper evaluation to encourage applica-tion programmers to embrace it. This paper introduces the KASTORS benchmark suite designed to evaluate OpenMP tasks dependencies. We modified state-of-the-art OpenMP 3.0 benchmarks and data-flow parallel linear algebra kernels to make use of tasks dependencies. Learning from this experience, we propose extensions to the current OpenMP specification to improve the expressiveness of dependen-cies. We eventually evaluate both the GCC/libGOMP and the CLANG/libIOMP implementations of OpenMP 4.0 on our KASTORS suite, demonstrating the in-terest of task dependencies compared to taskwait-based approaches

    NewMadeleine: a Fast Communication Scheduling Engine for High Performance Networks

    Get PDF
    International audienceCommunication libraries have dramatically made progress over the fifteen years, pushed by the success of cluster architectures as the preferred platform for high performance distributed computing. However, many potential optimizations are left unexplored in the process of mapping application communication requests onto low level network commands. The fundamental cause of this situation is that the design of communication subsystems is mostly focused on reducing the latency by shortening the critical path. In this paper, we present a new communication scheduling engine which dynamically optimizes application requests in accordance with the NICs capabilities and activity. The optimizing code is generic and portable. The database of optimizing strategies may be dynamically extended

    Dynamic Task and Data Placement over NUMA Architectures: an OpenMP Runtime Perspective

    Get PDF
    International audienceExploiting the full computational power of current hierarchical multiprocessor machines requires a very careful distribution of threads and data among the underlying non-uniform architecture so as to avoid memory access penalties. Directive-based programming languages such as OpenMP provide programmers with an easy way to structure the parallelism of their application and to transmit this information to the runtime system. Our runtime, which is based on a multi-level thread scheduler combined with a NUMA-aware memory manager, converts this information into ``scheduling hints'' to solve thread/memory affinity issues. It enables dynamic load distribution guided by application structure and hardware topology, thus helping to achieve performance portability. First experiments show that mixed solutions (migrating threads and data) outperform Next-touch-based data distribution policies and open possibilities for new optimizations

    StarPU-MPI: Task Programming over Clusters of Machines Enhanced with Accelerators

    Get PDF
    GPUs have largely entered HPC clusters, as shown by the top entries of the latest top500 issue. Exploiting such machines is however very challenging, not only because of combining two separate paradigms, MPI and CUDA or OpenCL, but also because nodes are heterogeneous and thus require careful load balancing within nodes themselves. The current paradigms are usually limited to only offloading parts of the computation and leaving CPUs idle, or they require static work partitioning between CPUs and GPUs. To handle single-node architecture heterogeneity, we have previously proposed StarPU, a runtime system capable of dynamically scheduling tasks in an optimized way on such machines. We show here how the task paradigm of StarPU has been combined with MPI communications, and how we extended the task paradigm itself to allow mapping the task graph on MPI clusters such as to automatically achieve an optimized distributed execution. We show how a sequential-like Cholesky source code can be easily extended into a scalable distributed parallel execution, and already exhibits a speedup of 5 on 6 nodes

    StarPU-MPI: Task Programming over Clusters of Machines Enhanced with Accelerators

    Get PDF
    International audienceGPUs clusters are becoming widespread HPC platforms. Ex- ploiting them is however challenging, as this requires two separate paradigms (MPI and CUDA or OpenCL) and careful load balancing due to node heterogeneity. Current paradigms usually either limit themselves to of- fload part of the computation and leave CPUs idle, or require static CPU/GPU work partitioning. We thus have previously proposed StarPU, a runtime system able to dynamically scheduling tasks within a single heterogeneous node. We show how we extended the task paradigm of StarPU with MPI to easily map the task graph on MPI clusters and automatically benefit from optimized execution
    • …
    corecore